{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evasion and Hardening of Tree Ensemble Classifiers\n",
    "\n",
    "> This project tries to implement the idea mentioned in the [paper](https://arxiv.org/pdf/1509.07892.pdf) which is to find finding for a given instance `x` the “nearest” instance `x_prime` such that the classifier predictions of `x` and `x_prime` are different."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently we have looked at Random Forest Classifiers ( scikit-learn ) implementation but the idea can easily be extended to GBDT. Currently the package implements the `Approximate Evasion` method implemented in the paper. For experimental evaluation we look at `MNIST` digit classification dataset with only two categories `2` and `6`. The paper also chooses this dataset because it is well studied datasets, one-to-one mapping between pixels and features and features can vary independently of each other. We can pictorially represent evading instances, and this helps understanding the models’ robustness or lack of."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to use"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "from tree_evasion.core import *\n",
    "from tree_evasion.tree import *\n",
    "from tree_evasion.symbolic_prediction import *\n",
    "\n",
    "Xtr, Xva, Xte, ytr, yva, yte = get_mnist_dataset(SEED)\n",
    "\n",
    "clf = RandomForestClassifier(n_estimators=5, max_depth=4, random_state=SEED, n_jobs=-1)\n",
    "clf.fit(Xtr, ytr)\n",
    "\n",
    "# performance on the holdout set\n",
    "print(clf.score(Xva, yva))\n",
    "\n",
    "pairs = CoordinateDescent.get_pairs(clf, Xte)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we want to change only one pixel for an instance and see if the classifier prediction changes or not. [notebook](02_SymbolicInstance.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/example_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/example_2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/example_3.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applications"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the paper mentions, we expect a high performance learning algorithm to generalize well and be hard to evade: only a “large enough” perturbation δ should be able to alter its decision. The existence of small-δ evading instances shows a defect in the generalization ability of the model, and hints at improper model class and/or insufficient regularization. \n",
    "\n",
    "Since machine learning systems are being deployed in many security-oriented applications. In these applications attacker has a large incentive to find evading instances and fool the system. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next Steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [ ] We have only tackled the `Approximate Evasion`, paper mentions another algorithm which uses Mixed Integer Linear Programming.\n",
    "- [ ] After finding good evading instances, author suggests to include them in the training set and retrain. This process is called `Adversarial Boosting`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
